class: center, middle, inverse, title-slide .title[ # ISA 419: Data-Driven Security ] .subtitle[ ## 13: Clustering ] .author[ ###
Fadel M. Megahed, PhD
Endres Associate Professor
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Spring 2024 ] --- ## Quick Refresher of Last Class ✅ Explain the difference between supervised and unsupervised learning. ✅ EUnderstand the importance of different preprocessing steps and when they should be used in ML. ✅ Explain typical error measures used for supervised and unsupervised learning tasks. --- ## Learning Objectives for Today's Class - Define clustering - Explain the `\(k-\)`means clustering algorithm for numeric datasets - Implement the `\(k-\)`means algorithm in Python using the `pycaret` package - Describe scenarios where other/advanced clustering algorithms are needed --- class: inverse, center, middle # What is Clustering? --- ## The Problem of Clustering - Given a **set of (high-dimensional) observations**, with a notion of **distance** between observations, **group the observations** into **some number of clusters**, so that: + Members of a cluster are close/similar to each other + Members of different clusters are dissimilar - **Usually:** + The observations are in a high-dimensional space + Similarity is defined using a distance measure, e.g., * Euclidean, Cosine, Jaccard, edit distance, etc. .footnote[ <html> <hr> </html> **Source:** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, <http://www.mmds.org> ] --- ## Clustering in 2D Space: A Fun Example .pull-left[ .center[ **Meet the Palmer penguins** [](https://allisonhorst.github.io/palmerpenguins/) ] ] .pull-right[ .center[ **Anatomical description of the dataset:** [](https://allisonhorst.github.io/palmerpenguins/) ] ] .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- ## Clustering in 2D Space: Formulation - Given a **set of observations (each containing bill length and depth)**, with a notion of **Euclidean distance** between observations, **group the observations** into **3 clusters**, so that: + Members of a cluster are close/similar to each other + Members of different clusters are dissimilar - Note we are assuming that we did not have a "label/type" for each penguin. .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- ## Clustering in 2D Space: Raw Data <img src="data:image/png;base64,#13_clustering_files/figure-html/bill_length_depth1-1.png" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- ## Clustering in 2D Space: Labeled Raw Data <img src="data:image/png;base64,#13_clustering_files/figure-html/bill_length_depth2-1.png" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- ## Clustering in 2D Space: Clustering Results <img src="data:image/png;base64,#13_clustering_files/figure-html/bill_length_depth3-1.png" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- ## Comments on the 2D Clustering Problem Even though the 2D Space clustering problem is the easiest problem to "solve" since we can benefit by plotting the data, **clustering is hard**. **Some important questions:** 1. With all the variables being numerical, we often assume **Euclidean distance**. This can be problematic when: - variables have significantly different scales - we are including information that is not pertinent to grouping 2. How do you determine the number of clusters (*k*)? 3. How to represent a cluster of many points? 4. How do we determine the "nearness" of clusters? --- ## An Overview of Clustering Methods [](https://www.computer.org/csdl/journal/ec/2014/03/06832486/13rRUEgs2xB) .footnote[ <html> <hr> </html> **Source:** A. Fahad, et al.,"A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis" in IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 03, pp. 267-279, 2014. <https://doi.org/10.1109/TETC.2014.2330519> ] --- class: inverse, center, middle # `\(k-\)`Means Clustering --- ## The General Idea The `\(k\)`-means algorithm clusters data by trying to separate samples in `\(n\)` groups of equal variance, minimizing a criterion known as the **inertia** or **within-cluster sum-of-squares** (see below). This algorithm requires the **number of clusters to be specified**. .center[ `\(\sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2)\)` ] **Inertia is a measure of how internally coherent clusters are; however, it suffers from various drawbacks:** - Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes. - Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated. .footnote[ <html> <hr> </html> **Source:** Clustering — scikit-learn 1.0.2 documentation <https://scikit-learn.org/stable/modules/clustering.html#$k$-means> ] --- ## The Steps of the `\(k\)`-means Algorithm In basic terms, the algorithm has three steps. 0. Step 0 chooses the initial centroids, with the most basic method being to choose `\(k\)`samples from the dataset `\(X\)` . After initialization, `\(k\)`-means consists of looping between the remaining two steps. 1. Step 1 assigns each sample to its nearest centroid. 2. Step 2 creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed. **The algorithm repeats these last two steps the centroids do not move significantly.** .footnote[ <html> <hr> </html> **Source:** Clustering — scikit-learn 1.0.2 documentation <https://scikit-learn.org/stable/modules/clustering.html#$k$-means> ] --- ## The `\(k\)`-means Algorithm: A Visual Example <img src="data:image/png;base64,#https://dashee87.github.io/images/kmeans.gif" width="100%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** David Sheehan. (2018). [Clustering with Scikit with GIFs. ](https://dashee87.github.io/data%20science/general/Clustering-with-Scikit-with-GIFs/) ] --- ## A Demonstration of `\(k\)`-means Clustering Use the `\(k\)`-means algorithm to cluster the following observations. Use `\(k=2\)` and Euclidean distance. **Use [this handout](https://miamioh.instructure.com/courses/209406/files/31265757?module_item_id=5085021) to go through the `\(k\)`-means algorithm implementation (by hand).** <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-baqh{text-align:center;vertical-align:top} .tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top} .tg .tg-0lax{text-align:left;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-amwm">Observation</th> <th class="tg-amwm">X1</th> <th class="tg-amwm">X2</th> </tr> </thead> <tbody> <tr> <td class="tg-baqh">1</td> <td class="tg-0lax">1.0</td> <td class="tg-0lax">1.0</td> </tr> <tr> <td class="tg-baqh">2</td> <td class="tg-0lax">1.5</td> <td class="tg-0lax">2.0</td> </tr> <tr> <td class="tg-baqh">3</td> <td class="tg-0lax">3.0</td> <td class="tg-0lax">4.0</td> </tr> <tr> <td class="tg-baqh">4</td> <td class="tg-0lax">5.0</td> <td class="tg-0lax">7.0</td> </tr> <tr> <td class="tg-baqh">5</td> <td class="tg-0lax">3.5</td> <td class="tg-0lax">5.0</td> </tr> <tr> <td class="tg-baqh">6</td> <td class="tg-0lax">4.5</td> <td class="tg-0lax">5.0</td> </tr> <tr> <td class="tg-baqh">7</td> <td class="tg-0lax">3.5</td> <td class="tg-0lax">4.5</td> </tr> </tbody> </table> --- # Practical Issues with `\(k\)`-means Clustering .panelset[ .panel[.panel-name[Data] ``` ## # A tibble: 344 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## 7 Adelie Torgersen 38.9 17.8 181 3625 ## 8 Adelie Torgersen 39.2 19.6 195 4675 ## 9 Adelie Torgersen 34.1 18.1 193 3475 ## 10 Adelie Torgersen 42 20.2 190 4250 ## # ℹ 334 more rows ## # ℹ 2 more variables: sex <fct>, year <int> ``` ] .panel[.panel-name[Prep] ``` ## # A tibble: 342 × 5 ## species bill_length_mm[,1] bill_depth_mm[,1] flipper_length_mm[,1] ## <fct> <dbl> <dbl> <dbl> ## 1 Adelie -0.883 0.784 -1.42 ## 2 Adelie -0.810 0.126 -1.06 ## 3 Adelie -0.663 0.430 -0.421 ## 4 Adelie -1.32 1.09 -0.563 ## 5 Adelie -0.847 1.75 -0.776 ## 6 Adelie -0.920 0.329 -1.42 ## 7 Adelie -0.865 1.24 -0.421 ## 8 Adelie -1.80 0.480 -0.563 ## 9 Adelie -0.352 1.54 -0.776 ## 10 Adelie -1.12 -0.0259 -1.06 ## # ℹ 332 more rows ## # ℹ 1 more variable: body_mass_g <dbl[,1]> ``` ] .panel[.panel-name[k-means (k=3)] ``` ## ## 1 2 3 ## Adelie 0 0 151 ## Chinstrap 0 1 67 ## Gentoo 66 57 0 ``` ] .panel[.panel-name[Optimal `\(k\)`] ``` ## *** : The Hubert index is a graphical method of determining the number of clusters. ## In the plot of Hubert index, we seek a significant knee that corresponds to a ## significant increase of the value of the measure i.e the significant peak in Hubert ## index second differences plot. ## ``` ``` ## *** : The D index is a graphical method of determining the number of clusters. ## In the plot of D index, we seek a significant knee (the significant peak in Dindex ## second differences plot) that corresponds to a significant increase of the value of ## the measure. ## ## ******************************************************************* ## * Among all indices: ## * 8 proposed 2 as the best number of clusters ## * 11 proposed 3 as the best number of clusters ## * 1 proposed 4 as the best number of clusters ## * 3 proposed 5 as the best number of clusters ## * 1 proposed 10 as the best number of clusters ## ## ***** Conclusion ***** ## ## * According to the majority rule, the best number of clusters is 3 ## ## ## ******************************************************************* ``` ``` ## ## 1 2 3 ## Adelie 8 0 143 ## Chinstrap 63 0 5 ## Gentoo 0 123 0 ``` ] .panel[.panel-name[Viz Clusters] <img src="data:image/png;base64,#13_clustering_files/figure-html/penguins6-1.png" style="display: block; margin: auto;" /> ] ] --- ## Summary of Practical Issues - Rescale numeric data prior to `\(k\)`-means implementation. The scaling can be: + a z-transformation similar to what we did in the example + a 0-1 scaling + converting count data into percentage or counts per a certain number of the population + etc. - Use more than one metric to determine `\(k\)` when using `\(k\)`-means clustering - Your cluster solution is not the end result, you will need to: + visualize it in appropriate way (simple representation as in the previous slide, [spatially](https://fmegahed.github.io/covid_analysis_final.html#33_Visualizing_the_Clustering_Results), [time-based](https://fmegahed.github.io/isa401/class23/23_data_mining_overview.html?panelset1=calendar-plot-of-clustered-data&panelset4=data3&panelset5=data4&panelset6=activity3&panelset7=activity4#10), etc.) + Attempt to explain the cluster membership using an appropriate binomial/multinomial model (e.g., see [this analysis](https://fmegahed.github.io/covid_analysis_final.html#4_Explanatory_Modeling_of_Cluster_Assignments)) --- class: inverse, center, middle # Clustering Implementation in Python --- ## The `pycaret` Package ```python import pandas as pd from pycaret.clustering import * from janitor import clean_names portmap_df = ( pd.read_csv('https://raw.githubusercontent.com/fmegahed/isa419/main/data/portmap.csv') .sample(n = 1000, random_state = 123) .clean_names() .dropna() .reset_index(drop = True) .loc[:, ['_flow_duration', '_total_fwd_packets', '_total_backward_packets', 'total_length_of_fwd_packets', '_total_length_of_bwd_packets', '_fwd_packet_length_max', '_label']] ) s = setup( portmap_df[subset_features], session_id = 2024, ignore_features= ['_label'], preprocess=True, normalize = True, normalize_method= 'zscore') kmeans = create_model('kmeans', num_clusters = 2) kmeans_results = assign_model(kmeans) plot_model(kmeans, plot = 'elbow') plot_model(kmeans, plot = 'cluster', feature = '_label') ``` ] --- class: inverse, center, middle # Other Clustering Methods --- ## Other Clustering Methods - **Hierarchical Clustering**: This method does not require the number of clusters to be specified. It is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: + **Agglomerative**: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. + **Divisive**: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. - **DBSCAN**: This is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions. --- ## Other Clustering Methods - **Memory-based Clustering**: This is a family of clustering algorithms that do not require the entire dataset to be in memory. Examples include: + **BIRCH**: This is a hierarchical clustering algorithm that can be used to cluster very large datasets. + **CLARANS**: This is a clustering algorithm that can be used to cluster very large datasets. --- class: inverse, center, middle # Recap --- ## Summary of Main Points By now, you should be able to do the following: - Define clustering - Explain the `\(k-\)`means clustering algorithm for numeric datasets - Implement the `\(k-\)`means algorithm in Python using the `pycaret` package - Describe scenarios where other/advanced clustering algorithms are needed --- ## 📝 Review and Clarification 📝 1. **Class Notes**: Take some time to revisit your class notes for key insights and concepts. 2. **Zoom Recording**: The recording of today's class will be made available on Canvas approximately 3-4 hours after the end of class. 3. **Questions**: Please don't hesitate to ask for clarification on any topics discussed in class. It's crucial not to let questions accumulate.